3. Analysis and Results
3.1 Dataset Description
For sentiment analysis, this project will use a data set of movie reviews from IMDB (Maas et al. 2011). The dataset includes 25,000 movie reviews, and since some movies are reviewed more often than others, the dataset includes a maximum of 30 reviews for any particular movie. The dataset includes only the top 5,000 most frequent words, however the top 50 most frequent words are also discarded as they are unlikely to contribute much to sentiment context. IMDB reviews include a star rating of 1 to 10 stars, and these ratings have been converted to a 0-1 scale for use as a sentiment classification label in the dataset.
The TensorFlow package (Abadi et al. 2015) includes this dataset in a vectorized format which is ideal for use in neural networks. The vectorization process starts by assigning each word that appears in the vocabularly (i.e. all the unique words in the dataset) with a unique number for a numeric substitution value. Then each word in the original text observation is replaced with the substitution number assigned to that word. Every observation is translated in this same way to convert from a string made up of words to a vector of integers. An example follows below:
Observation #1: “this is fun”
Observation #2: “fun times ahead”
Observation #3: “fun is ahead of times”
Based on the three above observations, there are six unique words in the vocabularly. These six unique words would each be assigned a numeric value. Thus the vocabulary list would be [1-this, 2-is, 3-fun, 4-times, 5-ahead, 6-of]. To vectorize the strings, the numeric values for each of those words are added to a integer vector. The vectorized observations are shown below.
Observation #1 vectorized: [1,2,3]
Observation #2 vectorized: [3,4,5]
Observation #3 vectorized: [3,2,5,6,4]
Install and Import Packages
Loading and preparing Training and Test Data
Load the tensorflow IMDB review dataset. Only the top most common 5000 words will be included. All other words will be replaced with a token representing an unknown word. Up to the first 500 words in a review are included in the training and test sets.
The neural network will be expecting batches of training examples that are 500 words long, so pad any observations shorter than 500 words with a token representing the padding word to get to the required 500 word length.
Statistical Modeling
In this section a neural network model will be implemented using the keras package to learn embeddings for each of the words in the vocabulary. The goal is to learn multi-dimensional vectors where similar words are close in the vector-space, where similar means having a similar contextual meaning with regards to the training dataset and its sentiment classification. For example, “gem” and “favorite” would be highly similar in the context of a movie review, whereas in a general context they would not be so similar.
The number of dimensions of the output embedding will be varied and tested as part of the modeling process. The number of dimensions is an important hyperparameter since it will control how much compression of the training set occurs. A small number of dimensions results in a higher amount of compression, whereas a large number of dimensions allows for more detail to be captured by the embeddings. However, a larger number of dimensions can also lead to overfitting (Yin and Shen 2018).
Next the embedding model is trained. (Chollet and Allaire 2018) and (Monroe, n.d.) were important resources in coding the embedding training. As a first step, the model will be trained repeatedly with a different number of dimensions each time. Models will be trained using from 2 to 7 dimensions, and the testing accuracy will be recorded for each model.
The neural network model uses an embedding layer that will convert the words in the vocabulary to a multi-dimensional vector embedding once trained. The number of inputs to the embedding layer is 5000, which corresponds to the number of words in the vocabulary. The selected number of outputs for the embedding layer is the dimensionality of the embedding vector. As mentioned previously, this dimensionality will be varried to test the performance of the embeddings across different sizes of embedding dimensions. A second layer in the neural network model flattens the 3 dimensional tensor output from the embedding layer to a 2 dimensional tensor. Finally, a dense layer connects every output from the flatten layer to the final output layer. The model is trained using back propagation to predict the sentiment classification variable, and the final trained weights of the embedding layer are the embeddings for each corresponding word in the vocabulary.
782/782 - 0s - 588us/step - acc: 0.8792 - loss: 0.2854
782/782 - 0s - 608us/step - acc: 0.8782 - loss: 0.2879
782/782 - 0s - 573us/step - acc: 0.8767 - loss: 0.2940
782/782 - 0s - 579us/step - acc: 0.8761 - loss: 0.2967
782/782 - 0s - 573us/step - acc: 0.8767 - loss: 0.2974
782/782 - 1s - 668us/step - acc: 0.8743 - loss: 0.3038
2 3 4 5 6 7
0.87924 0.87820 0.87672 0.87608 0.87672 0.87432
Surprisingly, an embedding of just 2 dimensions had the best accuracy. That may be the highest accuracy in predicting the binary sentiment classification, but the question is does that over compress the data and fail to represent the higher order patterns we hope the embedding models? As (Yin and Shen 2018) points out, “the impact of dimensionality on word embedding has not yet been fully understood…a word embedding with a small dimensionality is typically not expressive enough to capture all possible word relations, whereas one with a very large dimensionality suffers from over-fitting.”
The model with just 2 dimensions is tested to see how well it does on finding similar words, where similar is in the context of the sentiment of a movie review.
782/782 - 0s - 590us/step - acc: 0.8805 - loss: 0.2855
The words “awful”, “mediocre”, “perfect” and “favorite” are some positive and negative words that could be found in a movie review. These test words are ysed to qualitatively test the embedding model by examining which words are found to be close to the test words.
awful lame alcoholic sadly relevant are
1.0000000 1.0000000 1.0000000 0.9999998 0.9999998 0.9999998
mediocre effort turkey terrible stereotype repeat
1.0000000 1.0000000 1.0000000 0.9999999 0.9999999 0.9999998
perfect lovers sing manager bath donald
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9999999
favorite deeply round marie polanski poetry
1.0000000 1.0000000 1.0000000 1.0000000 0.9999999 0.9999999
Some related words are found, but there are some other words that don’t seem to be very related. Overall the results don’t appear very good, so it seems embeddings using only 2 dimensions is not adequate despite the high accuracy found on the test set.
(Yin and Shen 2018) states that selecting the number of dimensions is often done ad hoc or by using grid search, with a common method being to train embeddings of different dimensions and evaluate them models using a functionality test like word analogy. A similar method was used here on a smaller scale where the embedding model was retrained using a larger number of dimensions and the performance of the related words test was compared. There was insufficient time to test many model variations, but it was important to test a larger number of dimensions to compare to the 2 dimension model. The test accuracy for the model previously trained with 7 dimensions was fairly close to the accuracy for 2 dimensions, so that embedding length was tested next.
782/782 - 0s - 571us/step - acc: 0.8736 - loss: 0.3002
Here are the similar words for the same positive and negative words that were previously tested, but now tested using the new embedding model with higher dimensionality:
awful ultimately painful sorry fake nowhere
1.0000000 0.9979454 0.9972085 0.9959082 0.9958690 0.9951919
mediocre teeth incompetent main disappointing
1.0000000 0.9969511 0.9960854 0.9958095 0.9956890
generous
0.9951328
perfect great seeking freedom tremendous excellent
1.0000000 0.9983013 0.9981593 0.9978434 0.9972070 0.9955213
favorite paulie excellent necessary great seeking
1.0000000 0.9986090 0.9984505 0.9983542 0.9971755 0.9971190
These results are better than the 2 dimension model, so it seems test accuracy isn’t a good method to determine how many dimensions should be included in the embedding model.
Next, the number of epochs used in training will be evaluated to see how that impacts the model performance.
782/782 - 0s - 574us/step - acc: 0.8665 - loss: 0.3540
This keras graph of the accuracy of the training data (blue) vs. the testing data (green) shows that the testing accuracy starts to flatten at epoch 6, so it appears 6 epochs is effective. This is the number of epochs previously used in training, so the best model remains 7 dimensions trained with 6 epochs.
Data and Visualization
Conclusion
The development and analysis of the word embedding model for classifying IMDB movie reviews demonstrated promising results. The optimal number of embedding dimensions was identified as 7, achieving an accuracy of 87.34% on the test dataset. This was determined through extensive experimentation, revealing that higher dimensions, such as 7, provided competitive and consistent accuracy. The model’s performance is noteworthy, given the constraints of training on only the top 5000 most common words, minimal data preprocessing, and limiting input sequences to the first 500 words of each review. These factors illustrate the model’s robustness and effectiveness in capturing the semantic relationships within the data.
Furthermore, the embedding similarity results showed that the model could meaningfully capture semantic relationships, as evidenced by the coherent and relevant similar words found for terms like “awful,” “mediocre,” “perfect,” and “favorite.” The final training session, capped at 10 epochs, ensured the model did not overfit, maintaining its accuracy and reliability. Overall, the model’s strong performance under constrained conditions highlights its potential for practical applications in sentiment analysis, offering an efficient and effective solution for understanding and categorizing movie reviews.
Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015.
“TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” https://www.tensorflow.org/.
Camacho-Collados, Jose, and Mohammad Taher Pilehvar. 2020.
“Embeddings in Natural Language Processing.” In
Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts, edited by Lucia Specia and Daniel Beck, 10–15. Barcelona, Spain (Online): International Committee for Computational Linguistics.
https://doi.org/10.18653/v1/2020.coling-tutorials.2.
Chollet, Francois, and J. J. Allaire. 2018. Deep Learning with r. 1st ed. USA: Manning Publications Co.
Hasan, Md. Rakibul, Maisha Maliha, and M. Arifuzzaman. 2019.
“Sentiment Analysis with NLP on Twitter Data.” In
2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 1–4.
https://doi.org/10.1109/IC4ME247184.2019.9036670.
Kasri, Mohammed, Marouane Birjali, Mohamed Nabil, Abderrahim Beni-Hssane, Anas El-Ansari, and Mohamed El Fissaoui. 2022.
“Refining Word Embeddings with Sentiment Information for Sentiment Analysis.” Journal of ICT Standardization 10 (3): 353–82.
https://doi.org/10.13052/jicts2245-800X.1031.
Kathuria, Priyanshi, Parth Sethi, and Rithwick Negi. 2022.
“Sentiment Analysis on e-Commerce Reviews and Ratings Using ML & NLP Models to Understand Consumer Behavior.” In
2022 International Conference on Recent Trends in Microelectronics, Automation, Computing and Communications Systems (ICMACC), 1–5.
https://doi.org/10.1109/ICMACC54824.2022.10093674.
Kiros, Ryan, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. “Skip-Thought Vectors.” Advances in Neural Information Processing Systems 28.
Liang, Bin, Rongdi Yin, Jiachen Du, Lin Gui, Yulan He, Min Yang, and Ruifeng Xu. 2023.
“Embedding Refinement Framework for Targeted Aspect-Based Sentiment Analysis.” IEEE Transactions on Affective Computing 14 (1): 279–93.
https://doi.org/10.1109/TAFFC.2021.3071388.
Maas, Andrew L., Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011.
“Learning Word Vectors for Sentiment Analysis.” In
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–50. Portland, Oregon, USA: Association for Computational Linguistics.
http://www.aclweb.org/anthology/P11-1015.
Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. 2014.
“Sentiment Analysis Algorithms and Applications: A Survey.” Ain Shams Engineering Journal 5 (4): 1093–113.
https://doi.org/10.1016/j.asej.2014.04.011.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
“Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.
Monroe, Burt. n.d.
“Materials for Classes: ‘Text as Data’ (PLSC 597) at Penn State & ‘Advanced Text as Data: Natural Language Processing’ (2p) at Essex Summer School in Social Science Data Analysis.” TextAsDataCourse.
https://burtmonroe.github.io/TextAsDataCourse/.
N, Lavanya B., Anitha Rathnam K. V, Kiran K, P. Deepa Shenoy, and Venugopal K. R. 2024.
“Fusion of Deep Learning with Advanced and Traditional Embeddings in Sentiment Analysis.” In
2024 IEEE 9th International Conference for Convergence in Technology (I2CT), 1–6.
https://doi.org/10.1109/I2CT61223.2024.10543279.
Nasukawa, Tetsuya, and Jeonghee Yi. 2003. “Sentiment Analysis: Capturing Favorability Using Natural Language Processing.” In Proceedings of the 2nd International Conference on Knowledge Capture, 70–77.
Pennington, Jeffrey, Richard Socher, and Christopher D Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43.
Pilehvar, Mohammad Taher, and Jose Camacho-Collados. 2020. Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning. Morgan & Claypool Publishers.
Tang, Duyu, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou. 2016.
“Sentiment Embeddings with Applications to Sentiment Analysis.” IEEE Transactions on Knowledge and Data Engineering 28 (2): 496–509.
https://doi.org/10.1109/TKDE.2015.2489653.
Yin, Zi, and Yuanyuan Shen. 2018. “On the Dimensionality of Word Embedding.” Advances in Neural Information Processing Systems 31.